User Tools

Site Tools


products:duplicates

====== Duplicate Finder ====== ===== Installation ===== Use the Module Management Tool to install the LinkItemToSpaceAction Amp. Information about how to do this can be found at: [[http://wiki.alfresco.com/wiki/Module_Management_Tool|Module Management Tool Guide]] The Module Management Tool can be downloaded [[http://sourceforge.net/project/showfiles.php?group_id=143373&package_id=157460&release_id=524558|here]]. ===== How it works ===== To compare two documents a hashsum is created for each document. The hashsum is saved in the document properties. Therefore a duplicates aspect is assigned to the document with the properties hashsum and duplicate count. This is done by a behaviour that updates the hashsum each time the content of the document changes. Each time the hashsum is updated the number of duplicates is counted. ===== Usage ===== As soon as the duplicates aspect is assigned to a document you can see the number of duplicates in the documents details. Make a copy of a document that has the duplicate aspect. The number of duplicates should now be other then 0. Additionally there is a blue arrow on each document that has duplicates in the web clients' folder view. The blue arrow can also be seen in the documents details view. Click the button to see all the documents duplicates. ===== Initially assign the duplicates aspect to all documents ===== Using behaviours the duplicates aspect (hashsum and duplicate count) is automatically assigned to a document as soon as it is created or its content is changed. But if you install the Duplicate Finder in a database that already has lots of documents then all these documents have to processed by the Duplicate Finder initially. This chapter describes how to do this initial processing of all documents already in the database. If you install the Duplicate Finder in a fresh database you can skip this. To initially process all docs in the database you have to start the UpdateDuplicatesAction as a scheduled action. This is quite simple: * Install the duplicates.amp into your alfresco.war (see previous chapter) * Go to the <TOMCAT_ROOT>/shared/classes/alfresco/extension folder * Create a new file named scheduled-action-services-context.xml * Put the following lines of code into this file <code> <?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'> <beans> <!-- Define the model factory used to generate object models suitable for use with freemarker templates. --> <bean id="templateActionModelFactory" class="org.alfresco.repo.action.scheduled.FreeMarkerWithLuceneExtensionsModelFactory"> <property name="serviceRegistry"> <ref bean="ServiceRegistry"/> </property> </bean> <!-- An action that adds the duplicates aspect, the hashsum and the duplicates count to a node. --> <bean id="addDuplicatesTemplateActionDef" class="org.alfresco.repo.action.scheduled.SimpleTemplateActionDefinition"> <property name="actionName"> <value>updateDuplicatesAction</value> </property> <property name="parameterTemplates"> <map> </map> </property> <property name="templateActionModelFactory"> <ref bean="templateActionModelFactory" /> </property> <property name="dictionaryService"> <ref bean="DictionaryService" /> </property> <property name="actionService"> <ref bean="ActionService" /> </property> <property name="templateService"> <ref bean="TemplateService" /> </property> </bean> <bean id="addDuplicatesCron" class="org.alfresco.repo.action.scheduled.CronScheduledQueryBasedTemplateActionDefinition"> <property name="transactionMode"> <value>ISOLATED_TRANSACTIONS</value> </property> <property name="compensatingActionMode"> <value>IGNORE</value> </property> <property name="searchService"> <ref bean="SearchService" /> </property> <property name="templateService"> <ref bean="TemplateService" /> </property> <property name="queryLanguage"> <value>lucene</value> </property> <property name="stores"> <list> <value>workspace://SpacesStore</value> </list> </property> <!-- Find all nodes that do not have the aspect --> <property name="queryTemplate"> <value>PATH:"//\*" -TYPE:"{http://www.alfresco.org/model/content/1.0}systemfolder" -TYPE:"{http://www.alfresco.org/model/content/1.0}folder" -TYPE:"{http://www.alfresco.org/model/application/1.0}folderlink" -TYPE:"{http://www.alfresco.org/model/application/1.0}filelink" -TYPE:"{http://www.alfresco.org/model/content/1.0}category_root" -TYPE:"{http://www.alfresco.org/model/content/1.0}category" -TYPE:"{http://www.alfresco.org/model/content/1.0}dictionaryModel" -TYPE:"{http://www.alfresco.org/model/content/1.0}link" -TYPE:"{http://www.alfresco.org/model/content/1.0}person" -TYPE:"{http://www.alfresco.org/model/action/1.0}actioncondition" -TYPE:"{http://www.alfresco.org/model/content/1.0}authorityContainer" -TYPE:"{http://www.alfresco.org/model/rule/1.0}rule"</value> </property> <property name="cronExpression"> <value>0 14 15 * * ?</value> </property> <property name="jobName"> <value>jobDuplicates</value> </property> <property name="jobGroup"> <value>jobGroup</value> </property> <property name="triggerName"> <value>triggerDuplicates</value> </property> <property name="triggerGroup"> <value>triggerGroup</value> </property> <property name="scheduler"> <ref bean="schedulerFactory" /> </property> <property name="actionService"> <ref bean="ActionService" /> </property> <property name="templateActionModelFactory"> <ref bean="templateActionModelFactory" /> </property> <property name="templateActionDefinition"> <ref bean="addDuplicatesTemplateActionDef" /> </property> <property name="transactionService"> <ref bean="TransactionService" /> </property> <property name="runAsUser"> <value>System</value> </property> </bean> </beans> </code> * Have a look at this line <code> <property name="cronExpression"> <value>0 15 23 * * ?</value> </property> </code> Thats where the start time for the action must be given. In my example the action will start at 15 minutes past 11pm. To know more about cron expressions have a look at [[http://wiki.alfresco.com/wiki/Scheduled_Actions#Cron_Explained|Scheduled Actions - Alfresco Wiki]]. * Start Alfresco. The processing of the nodes should start in the background at the time given. If you want to have some info about the processing done by the update duplicates action, set the logger for the action on info. Therefore add this line <code> log4j.logger.de.hmedia.alfresco.actions=info </code> at the end of your <TOMCAT_ROOT>/webapps/alfresco/WEB-INF/classes/log4j.properties file.

products/duplicates.txt · Last modified: 2023/11/19 22:46 (external edit)